An Efficient Approach to Encoding Context for Spoken Language Understanding

2018-11-08

NLP, NLU

SLU是任务型对话系统的基础，本文提出了一种基于对话历史建模的SLU模型，通过RNN对对话上下文进行建模，进而辅助对当前句子的理解，并且可以用于DST（对话状态追踪）。
paper link
dataset link

Introduction

任务型对话系统SLU模块的作用在于将用户的自然语言解析成语义帧（semantic frames）的形式，包括intent , dialogue act and slots，实例如下：

本文采用的是对intent , dialogue act and slots 预测联合建模的方式，这也是现在最通用的做法。

之前大部分关于SLU的研究集中于单轮的语言理解，此时系统（agent）只接收当前时刻的用户输入，外部的知识库以及基于frame的上下文。然而，任务型对话系统包含多轮的用户与系统交互，以实现user goal。多轮的SLU面临以下挑战：用户和系统可能同时指代在之前对话轮次中出现的实体，这带来了歧义。例如，three 在不同的语境中，可以代表日期、时间、电影票数量以及餐厅评分。来自于之前轮次的用户和系统的输入已经被证实可以解决以上问题。然而大部分的工作只使用了系统前一个输入的信息。memory networks则是对整个的对话历史进行建模。

本质上，基于memory networks的方法会将之前轮次中的用户以及系统输入进行编码，例如使用RNN等。这些memory embeddings 被一起作为 context vector ，进而预测SLU的输出。具体来说，可以通过计算当前时刻用户的输入与memory embeddings 的attention score，或者使用RNN对memory embeddings进行编码。

尽管memory networks可以提高准确率，但是在计算上并不高效。原因如下：

对于每个turn，需要处理之前所有的历史输入的自然语言。
对话上下文可以潜在地通过dialogue state tracker来获得。使用单独的SLU-specific网络而不是复用DST的上下文信息会导致运算加倍。
memory networks是将系统输入的自然语言进行编码，忽略了系统的dialogue act；二者含有同样的信息，但是dialogue act更结构化并且类别更少。

本文提出了一种更加高效的对话上下文编码方法，主要贡献在以下两方面：

直接对系统的dialogue act进行编码，取代了对系统输入的自然语言编码。这使得我们可以复用DM的输出结果来获取上下文。
使用层级RNN对上下文编码，一个时间步长处理一个轮次的输入。减少了计算量同时性能没有下降。

Our representation of dialogue context is similar to those used in dialogue state tracking models [17, 18, 19], thus enabling the sharing of context representation between SLU and DST.

Approach

模型概述：假定每个对话有T turns，每一轮包括用户输入的自然语言和系统的dialogue acts（注意此处的act最多只能包括一个slot，因此一句话可以对应多个act）。下图是模型整体架构图：

对于每一个轮次t，使用system act encoder得到输入的系统dialogue act集合 $A^{t}$ 的Embedding表示 $a^{t}$，使用utterance encoder对用户输入的自然语言做编码，得到 $u^{t}$ 。dialogue encoder是一个RNN网络，当前时间步长的输入为 $a^{t}$ 和 $u^{t}$ ，结合上一个时间步的隐层状态 $s^{t-1}$ ，生成对话上下文表达 $o^{t}$ ，同时更新当前时间步的隐层状态 $s^{t}$ 。$o^{t}$ 被用来做用户intent分类和dialogue act分类。utterance encoder的输出$u^{t}$作为slot tagger的输入，这个模块的作用是从用户输入的自然语言中提取slot的值。

utterance encoder和slot tagger都是用的是双向RNN，除了上述输入之外，都额外增加了上下文向量$o^{t}$作为输入，具体细节见以下详细描述。

System Act Encoder

System Act Encoder 的作用是将时刻t的系统dialogue acts进行编码，得到$a^{t}$，编码与act的出现顺序无关。这与基于自然语言的编码不同，其会隐式地包含顺序信息。

每个act包含act type以及可选参数slot, value，作者将所有的act分成两类：

带有一个slot的act（一个act最多有一个slot），可以含slot value，也可以不包含：request(time), negate(time='6pm')
不含slot的act：greeting

_Note that the same dialogue act can appear in the dialogue with or without an associated slot (negate(time=‘6 pm’) versus negate)._

定义：

$A_{sys}$ : 所有系统act的集合
$a^{t}_{slot}(s)$ : binary vector, len=$\left | A_{sys} \right |$，代表act with slot，不含slot的值
$a^{t}_{ns}$ : binary vector, len=$\left | A_{sys} \right |$，代表act without slot
e_{s} : embedding for slot s
$S^{t}$ : slot集合

System Act Encoder 本质上是一个全连接网络，结构如下：

Utterance Encoder

Utterance Encoder的作用是获得用户输入token sequence的表征，输入为用户的自然语言序列（首末分别加上SOS和EOS token），输出为对应token embedding。
定义：

$x_{t}=\left \{ x^{t}_{m}\epsilon R^{u_{d}}, \forall 0 \leq m< M^{t} \right \}$：用户输入的token embedding
$M^{t}$：第t轮，用户输入的序列长度
$u^{t} \epsilon R^{2d_{u}}$：对整个用户输入的表征
$u_{o}^{t} =\left \{ u_{o,m}^{t}\epsilon R^{2d_{u}}, \forall 0 \leq m< M^{t} \right \}$：对应输入token序列的表征

Utterance Encoder本质上是一个单层双向的GRU：
$$u^{t}, u_{o}^{t}=BRNN_{GRU}(x^{t}) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (5)$$

Dialogue Encoder

Dialogue Encoder是一个单向的GRU RNN，每一个时间步长代表一个对话轮次，目的是得到每一轮对话的上下文表征。输入为 $a^{t} \bigoplus u^{t}$，结合上一轮次的隐层状态 $s^{t-1}$，得到当前轮次的输出$o^{t}$以及隐层状态$s^{t}$（对于GRU Cell来说二者是一样的），$o^{t}$即为第t轮的对话上下文表征。

这种编码方式相对于memory networks来说更加高效，因为memory networks在每一轮次都需要处理所有的对话历史。

Intent and Dialogue Act Classification

The user intent helps to identify the APIs/databases which the dialogue system should interact with.Intents are predicted at each turn so that a change of intent during the dialogue can be detected.

论文假定用户每次输入只能包含一个intent，在每一轮预测基于所有intent集合的概率分布，如下式；而Dialogue Act Classification则看成一个多标签分类任务，即对于一个用户输入可以有多个dialogue act标签。

定义：

$p_{i}^{t}$：len=$\left|I \right|$，intent 概率分布
$p_{a}^{t}(k)$：probability of presence of dialogue act k in turn t
$I$：user intent set
$A_{u}$：dialogue act set
$W_{i}\epsilon R^{d \times \left|I \right|}, W_{a}\epsilon R^{d \times \left|A_{u} \right|}, len(o^{t})=d$

During inference, we predict $argmax(p_{i}^{t})$ as the intent label and all dialogue acts with probability greater than $t_{u}$ are associated with the utterance, where 0 < $t_{u}$ < 1.0 is a hyperparameter tuned using the validation set.

Slot Tagging

Slot tagging is the task of identifying the values for different slots present in the user utterance.

Slot Tagger是一个Bi-LSTM，输入为Utterance Encoder的输出token embedding，得到 $s_{o}^{t}=\left \{ s_{o,m}^{t}\epsilon R^{2d_{s}},0\leq m< M^{t} \right \}$，$M^{t}$ 是用户输入的token序列长度。对于第m个token，使用 $s_{o,m}^{t}$ 做softmax分类得到基于 $2\left| S\right|+1$ 个标签的概率分布，S是所有的slot构成的集合。

We use an LSTM cell instead of a GRU because it gave better results on the validation set.

Experiments

在本文中，设定一个turn包含系统和用户的一个来回，系统先说一句话，之后用户再回复一句。本文使用的对话上下文编码实际包含两种信息：

the dialogue encoding vector $o^{t-1}$ encodes all turns prior to the current turn
the system intent vector $a^{t}$ encodes the current turn system utterance

因此，当系统说完一句话后，使用 $o^{t-1}$ 和 $a^{t}$ 一起对整个对话历史做编码。并且，这些向量表示可以被作为SLU中很多模块的额外输入：

Positions A and C feed context vectors as additional inputs at each RNN step whereas positions B and D use the context vectors to initialize the hidden state of the two RNNs after a linear projection to the hidden state dimension.

本文选取的实验配置如下：

只有 $a^{t}$ ，没有dialogue encoder模块：在A-D某一个位置上将 $a^{t}$ 作为额外输入，去掉dialogue encoder模块，直接用 $u^{t}$ 代替 $o^{t}$ 做intent和act分类。在这种配置下，实验证明，在位置B添加 $a^{t}$ 可以在验证集上达到最优效果，测试集结果见Table 1的第七行。
只有 $a^{t}$ ：将 $a^{t}$ 作为dialogue encoder模块的输入，同时在A-D某一个位置上将 $a^{t}$ 作为额外输入。Table 1的第八行代表这种配置下的最好模型，此时将 $a^{t}$ 添加到D位置。
只有 $o^{t-1}$ ：将 $a^{t}$ 作为dialogue encoder模块的输入，在C或者D位置上添加 $o^{t-1}$ 作为额外输入。Table 1的第九行代表这种配置下的最好模型，此时将 $o^{t-1}$ 添加到D位置。
$a^{t}$ 和 $o^{t-1}$：将 $a^{t}$ 作为dialogue encoder模块的输入，在C或者D位置上独立添加 $o^{t-1}$ 或者 $a^{t}$ 作为额外输入，共有四种情况。Table 1的第十行代表这种配置下的最好模型，此时将 $o^{t-1}$ 添加到D位置，$a^{t}$ 到C位置。

Dataset

选择的是谷歌的对话数据集，包含12个slot类别，21种用户dialogue act。这个数据集比较大的挑战是有很多未出现的实体。

For instance, only 13% of the movie names in the validation and test sets are also present in the training set.

Baselines

本文选取以下四种模型作为基线：

NoContext: A two-layer stacked bidirectional RNN using GRU and LSTM cells respectively, and no context.
PrevTurn: This is similar to the NoContext model. with a different bidirectional GRU layer encoding the previous system turn, and this encoding being input to the slot tagging layer of encoder i.e. position C in Figure 2.
MemNet: This is the system from [11], using cosine attention. For this model, we report metrics with models trained with memory sizes of 6 and 20 turns. A memory size of 20, while making the model slower, enables it to use the entire dialogue history for most of the dialogues.
SDEN: This is the system from [12] which uses a bidirectional GRU RNN for combining memory embeddings. We report metrics for models with memory sizes 6 and 20.

Training and Evaluation

We use sigmoid cross entropy loss for dialogue act classification (since it is modeled as a multilabel binary classification problem) and softmax cross entropy loss for intent classification and slot tagging. During training, we minimize the sum of the three constituent losses using the ADAM optimizer [25] for 150k training stepswith a batch size of 10 dialogues.

To improve model performance in the presence of out of vocabulary (OOV) tokens arising from entities not present in the training set, we randomly replace tokens corresponding to slot values in user utterance with a special OOV token with a value dropout probability that linearly increases during training.

To find the best hyperparameter values, we perform grid search over the token embedding size {64;128;256},
learning rate [0.0001, 0.01], maximum value dropout probability [0:2;0:5] and the intent prediction threshold {0.3,0.4,0.5}, for each model configuration listed in Section 3. The utterance encoder and slot tagger layer sizes are set equal to the token embedding dimension, and that of the dialogue encoder to half this dimension. In Table 1, we report intent accuracy, dialogue act F1 score, slot chunk F1 score [22] and frame accuracy on the test set for the best runs for each configuration in Section 3 based on frame accuracy on the combined validation set, to avoid overfitting.

A frame is considered correct if its predicted intent, slots and acts are all correct.

Results and Discussion

本文提出的模型与MemNet and SDEN基线模型的准确率相当，均远远优于无上下文模型，证明上下文信息在SLU中的重要性。
另一个关注方面是计算效率：memory network在每个轮次都需要处理对话历史中的很多输入语句，而本文提出的模型只需要经过一个前馈全连接网络以及RNN的一步计算即可得到上下文表征。SDEN比memory network更慢，因为它需要将memory network的输出embedding再通过RNN。

Empirically, MemNet-6 and MemNet-20 experiments took roughly 4x and 12x more
本文提出的模型在小数据集（Sim-M）上的泛化能力更优。

Two interesting experiments to compare are rows 2 and 7 i.e. “PrevTurn” and “$a^{t}$ only, No DE”; they both use context only from the previous system utterance/acts, discarding the remaining turns. Our system act encoder, comprising only a two-layer feedforward network, is in principle faster than the bidirectional GRU that “PrevTurn” uses to encode the system utterance. This notwithstanding, the similar performance of both models suggests that using system dialogue acts for context is a good alternative to using the corresponding system utterance.

Table 1中也显示了最优的 $a^{t}$ 和 $o^{t-1}$ 输入位置。总体来说，将它们作为RNN Cell的初始状态（B,D）要优于单独拼接输入（A,C）。作者认为这可能是因为 $a^{t}$ 和 $o^{t-1}$ 对于每个用户token来说都是相同的，造成了冗余。
在slot tagger任务的准确率上，使用 $o^{t-1}$ 与 $a^{t}$ 相比并没有带来提升。这表明：系统act中的slot与用户回复中提到的slot有很强的相关性，用户回复的通常是与上一个系统act直接相关，而与之前的轮次相关性不大。

Conclusion

本文提出了一种快速有效的对对话上下文进行编码的SLU模型，避免了memory network低效的运算方式，同时准确率没有受到影响。并且可以应用于对话系统的其它组件中，例如状态追踪。

Helic He

NLPNLU